Single-cell sequencing uses tiny nanoliter droplets that are created using microfluidics. Ideally, each droplet contains one individual cell and one bead that has a molecular barcode, used to uniquely label transcripts from that cell.
If the volume of each droplet is 0.5 nanoliter, and cells are present at a constant concentration of 100 cells per microliter, what is the rate of cells included per droplet? (Hint: Remember to convert units!!!)
# Calculate the rate of the event (a cell being present in a droplet)
# Convert ul to nl: conc = (100/ul)/(1000/nl) = 0.1 cell / nl
cell.droplet.rate =
print(paste("Rate is", cell.droplet.rate, "cells per droplet"))
## Error in paste("Rate is", cell.droplet.rate, "cells per droplet"): object 'cell.droplet.rate' not found
What is the expected percentage of droplets that don’t have even a single cell?
print(paste0( , "% of droplets do not contain a cell."))
## Error in paste0(, "% of droplets do not contain a cell."): argument is missing, with no default
Typically, an experiment might start out with around one million cells. Is the Poisson distribution a good model for these data?
# your answer here
Plot the PMF for the Poisson distribution using this rate (the PMF is a PDF for a discrete distribution):
pois family of functions to generate an ideal Poisson PMF with the appropriate lambda parameter across a range of 0-10 cells.Cell_count and Density.ggplot to plot a histogram of the probability distribution of cell counts per droplet.
stat="identity" in the geom layerscale_x_continuous(breaks=pretty_breaks()) (pretty breaks is from the scales package)# ideal probability mass function
range = c(0:10)
cell.droplet.pmf =
# Check the sum of the PMF
# Create the data frame
# Plot the distribution using ggplot
## Error: <text>:13:0: unexpected end of input
## 11: # Plot the distribution using ggplot
## 12:
## ^
The experimentalist can control the concentration of cells in this experiment. Plot the distribution of cell counts for experiments with 100, 1000, or 2000 cells/microliter.
Generate PDF data for each concentration of cells/ul
Make a data frame containing the distributions
Convert the data frame to long format before plotting
melt from the reshape package:
Use ggplot2 to make a histogram showing all three distributions:
stat="identity" in the histogram layerposition="dodge2" to position the bars next to each other instead of superimposed# original rate per droplet (for 100 cells/ul)
mu = cell.droplet.rate
mu
# generate the PDFs
cell.droplet.pmf.n100 = # 100
cell.droplet.pmf.n1000 = # 1k
cell.droplet.pmf.n2000 = # 2k
# Create a data frame with one column for the range vector and one for each dataset
# look at the first few lines of the data.frame
# Melt the dataframe for ggplot to convert the data to "long" format using melt
# This allows us to group the data by Concentration for plotting
# Specify the column names for id.vars, value.name, variable.name
# look at the first few lines of the melted data frame
# Make the plot
## Error: <text>:27:0: unexpected end of input
## 25:
## 26:
## ^
Assuming experiment produces one million (1e6) droplets, convert the density plot above to a histogram showing the number of droplets per experiment that contain 0-10 cells per droplet for each concentration. Add a label to the y-axis indicating that it now represents the number of droplets instead of the density.
# Make the plot
To make the next few steps easier, now go ahead and add a column to your melted data frame containing the number of droplets out of 1e6 corresponding to the density for each datapoint. Check the sum of the droplet counts for each concentration to make sure this column adds up correctly.
# add a column for cell counts out of 1e6
# check sum of counts
Select just the rows with 1 cell, and then plot the number of droplets that are expected to contain only one cell for each concentration. Instead of using geom_histogram(), try using geom_col().
# filter the data
# Plot the data as a bar plot
Which concentration produces the largest number of droplets with one cell?
# your answer here
The optimal cell concentration maximizes the number of droplets with a single cell while at theh same time minimizing the rate of ‘doublets’, which have 2 or more cells in a single droplet.
Plot the number of droplets that have 1 cell and the number that have 2+ cells for each experimental concentration. To do this, we suggest the following steps (but you can do this any way you want):
aggregate function to do this, using the formula . ~ Concentration.Whatever way you choose to do this, make sure to comment your code!
# filter the data
# sum up the totals by group using the aggregate function
# relabel the cell counts as "2+"
# combine the 1 and 2+ data
# Plot the combined data as a bar plot
Which experimental concentration has the lowest doublet rate? Which concentration would you choose for your experiment?
# your answer here
Above, we simulated idealized data for this experiment for the purposes of illustration. If we could optically measure the number of cells per droplet in an actual experiment, then we could see whether the data actually fit the expected distribution.
Let’s say that you’ve used a cell counter at the start of the experiment to estimate the concentration of cells. You targeted to get a concentration of 100 cells per microliter, but the actual concentration was 110 cells per microliter.
Now you want to see how well the actual distribution of cells per droplet will match the expected distribution.
xtabs() to make a table of the expected number of cells per droplet for the ideal n100 distribution.
Expected_droplets ~ Cell_count here.# sample 1M droplets from the Poisson distribution
# make a frequency table from the sampled data
# filter the n100 data from the ideal distributions
# use xtabs to make a table of the ideal n100 data
Since the sampled data will have only up to 3 or 4 cells per droplet, you will notice that the two tables are not the same size! To generate the table we need for the Chi-square test, you will need to truncate the “ideal” table to match the length of the sampled data.
Go ahead and this, and then join them together into a single contingency table. You should end up with a table containing two rows and around 3-4 columns.
# truncate the table with the ideal data to the correct length
# bind the data together into a single table
Plot the distribution of cell counts per droplet for the observed and expected data. How different do they look?
# make a bar plot to compare these visually
Finally, perform a Chi-squared test of the observed vs. expected cell counts.
# do the chi-squared test
How do your sampled data compare to the ideal data in the Chi-squared test?
# your answer here
Over the last few years continuous improvements in single-cell technology have been made to increase the proportion of droplets containing a single cell as well as throughput. Alternative methods such as split-pooling have also been developed that can be scaled to accommodate very large samples, and a variety of methods have been developed to allow multiplexing of multiple samples.
In a previous homework, we looked at the mouse high-fat diet dataset and computed confidence intervals for multiple samples of 12 mice from the control population. Since we didn’t know about the \(t\)-distribution, we just used a \(z\)-distribution to calculate confidence intervals.
You may or may not have noticed that when you re-ran the code for finding the 95% CI for samples from the control population, that usually slightly more than 5 samples out of 100 did not contain the true population mean. This is because 95%CIs for the \(z\)-distribution were too narrow and did not take into account the variability of small samples!
The dataset attached to this homework contains the cleaned dataset we made in that homework. First, load the dataset and create a vector containing just the weights for the male control population.
# load the dataset
mice.pheno =
# subset the bodyweights of the **male** mice fed the **chow** diet
chow =
## Error: <text>:6:0: unexpected end of input
## 4: # subset the bodyweights of the **male** mice fed the **chow** diet
## 5: chow =
## ^
Below is reproduced the code from that homework for visualizing 100 95% confidence intervals, using just the chow data:
# ============================================================================ #
n.samples = 100 # number of random samples
N = 12 # sample size
Q = qnorm(0.975) # z-score for +/- 47.5% of a normal distribution
# set up a plot showing a vertical line with the true mean for each population
# rename chow and hf to reflect that we are using these as the population
chowPop = chow
## Error in eval(expr, envir, enclos): object 'chow' not found
plot(mean(chowPop)+c(-7,7),c(1,1),type="n",
xlab="weight",ylab="interval",ylim=c(1,n.samples))
## Error in mean(chowPop): object 'chowPop' not found
abline(v=mean(chowPop))
## Error in mean(chowPop): object 'chowPop' not found
# display the 95% CI for a bunch of random samples
#ci_range = c() # initialize empty vector
for (i in 1:n.samples) {
chow.sample = sample(chowPop,N) # take a random sample
se = sd(chow.sample)/sqrt(N) # compute SEM
interval = c(mean(chow.sample)-Q*se, mean(chow.sample)+Q*se) # compute 95% CI
# ci_range[i] = interval[2] - interval[1] # record size of ci interval
# draw the 95% CI -- color it green if it contains the true population mean, or red if not
covered = mean(chowPop) <= interval[2] & mean(chowPop) >= interval[1]
color = ifelse(covered,"forestgreen","red")
lines(interval, c(i,i), col=color)
}
## Error in sample(chowPop, N): object 'chowPop' not found
#summary(ci_range) # display summary of the distribution of CI ranges
# ============================================================================ #
Re-run the above code block a bunch of times. Do you notice that usually more than 5 CI’s don’t contain the true mean?
Now turn the for loop above into a function called ci.fail.counter. Instead of having the code add a line to a graph, modify it so that all it does is return the number of CI’s out of 100 that DO NOT contain the true population mean.
The function should take the following parameters:
chowPop: Data for an entire population (vector)N: the sample size (default = 12)n.samples: the number of samples to take (default = 100)Note:
Q inside the function so it doesn’t depend on any other outside information.ci.fail.counter = function( ... ) {
# initialize variables
counter = # create a counter
Q = # z-score for +/- 47.5% of a normal distribution
for (i in 1:n.samples) {
chow.sample = sample(chowPop,N) # take a random sample
se = sd(chow.sample)/sqrt(N) # compute SEM
interval = c(mean(chow.sample)-Q*se, mean(chow.sample)+Q*se) # compute 95% CI
# flag the 95% CI as either covering or not covering the true population mean
covered = mean(chowPop) <= interval[2] & mean(chowPop) >= interval[1]
# add to counter if CI doesn't cover the population mean
if ( ... ) { ... }
}
# return the total number of CI's that don't contain the population mean
return( ... )
}
Below, test your function by calling it once.
When everything looks ok, run the function 100 times and record the number of CI’s that don’t cover the true population mean each time.
Check the results by making a table of the counts.
# call the function once
ci.fail.counter(chowPop = chowPop)
# initialize an empty vector to hold the counts
ci.fail.count =
# run the function 100 times and fill up the fail count vector with the results from each run
# take a peek at the first few rows of the vector
# make a table to check the results
## Error: <text>:14:0: unexpected end of input
## 12: # make a table to check the results
## 13:
## ^
Make a quick plot of the number of 95%CIs per 100 CI’s that don’t contain the true population mean. How does the distribution look? What are the mean and median failed CI’s using the \(z\)-distribution?
# histogram of failed 95% CI's per 100 CI's
Copy your function above and modify it so that instead of using the \(z\)-distribution, it uses the 95% quantiles for a \(t\)-distribution with the right number of d.f.
# modified function using Q for t-distribution
Repeat part (b) using the new version of your function.
Repeat part (c) using the new data based on the \(t\)-distribution.
How do these results look in comparison to your earlier results?
# your answer here